Computer-assisted pronunciation training—Speech synthesis is almost all you need

نویسندگان

چکیده

The research community has long studied computer-assisted pronunciation training (CAPT) methods in non-native speech. Researchers focused on studying various model architectures, such as Bayesian networks and deep learning methods, well the analysis of different representations speech signal. Despite significant progress recent years, existing CAPT are not able to detect errors with high accuracy (only 60% precision at 40%–80% recall). One key problems is low availability mispronounced that needed for reliable error detection models. If we had a generative could mimic produce any amount data, then task detecting would be much easier. We present three innovative techniques based phoneme-to-phoneme (P2P), text-to-speech (T2S) speech-to-speech (S2S) conversion generate correctly pronounced synthetic show these only improve machine models errors, but also help establish new state-of-the-art field. Earlier studies have used simple generation P2P conversion, an additional mechanism detection. We, other hand, consider first-class method errors. effectiveness assessed tasks lexical stress Non-native English corpora German, Italian, Polish speakers evaluations. best proposed S2S technique improves AUC metric by 41% from 0.528 0.749 compared approach.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

All You Need Is Mentorship

I find it humbling to confess that most of the truly original ideas that have driven my research group’s agenda over four decades of time have come, not from my own brain, but instead from the minds of my trainees, both graduate students and post-docs. This on its own might explain why I, rather selfishly, have given them long leashes, allowing them to strike out on their own and craft their ow...

متن کامل

All You Need Is Compassion

The paper presents a new deductive rule for verifying response properties under the assumption of compassion (strong fairness) requirements. It improves on previous rules in that the premises of the new rule are all first order. We prove that the rule is sound, and present a constructive completeness proof for the case of finite-state systems. For the general case, we present a sketch of a rela...

متن کامل

CNN Is All You Need

CNNs have been successfully used in audio, image and text classification, analysis and generation [12,17,18], whereas the RNNs with LSTM cells [5,6] have been widely adopted for solving sequence transduction problems such as language modeling and machine translation [19,3,5]. The RNN models typically align the element positions of the input and output sequences to steps in computation time for ...

متن کامل

Attention is All you Need

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. E...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Speech Communication

سال: 2022

ISSN: ['1872-7182', '0167-6393']

DOI: https://doi.org/10.1016/j.specom.2022.06.003